-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-32467: nodepool_controller: add a reconciler for cleanup #3969
OCPBUGS-32467: nodepool_controller: add a reconciler for cleanup #3969
Conversation
@stevekuznetsov: This pull request references Jira Issue OCPBUGS-32467, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
afe605c
to
000f5f8
Compare
return ctrl.Result{}, err | ||
} | ||
|
||
targetPayloadConfigHash := supportutil.HashSimple(config + targetVersion + pullSecretName + globalConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally this would be forced by a function signature or similar since we are using this formula on creation as well, otherwise there's a divergence risk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, was hoping there would be a suggestion on this. They're all strings so even a signature ends up being stringly typed - WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh not ideal but I guess having a signature is still slightly better, say you are modifying the code in the main loop, you add a new parameter. Binary won't compile unless you include the same number of parameters here. But then yeh you could satisfy it by introducing the wrong string here.
Maybe start with a signature and then refactor a bit towards having a single func targetHash(nodePool) (string)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func targetHash(nodePool) string
is not really possible since getting all the data for it requires a lot of other actions - and, even if we had it hang off of the reconciler as a method, the main reconciliation method ends up using a lot of the intermediate calculated steps in other parts of the method, so it would be something like func (r *NodePoolReconciler) targetHash(nodePool) (expectedCoreConfigResources, releaseImage, ... hash)
. I can do that but it seemed awkward.
It's not clear to me what would cause this issue:
How would the HO upgrade cause this? Can you articulate this? 2 - The PR desc says
This is a valid use case, and it should be already covered by dropping this todo now https://github.com/openshift/hypershift/blob/main/hypershift-operator/controllers/nodepool/nodepool_controller.go#L780-L793 for the user data, and by the https://github.com/openshift/hypershift/blob/main/ignition-server/controllers/tokensecret_controller.go#L159-L170 for expired tokens. 3 - The PR desc says
This I can see as the main valid use case for this pruner. |
@enxebre I think for 2) the code you linked is still susceptible to leaks since you'd have to ensure the cleanup happened before the NodePool changed again, and we don't have a transactional way to do that. |
return ctrl.Result{}, nil | ||
} | ||
|
||
log.WithValues("options", names, "valid", valid).Info("removing secret as it does not match the expected set of names") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By deleting the token secret synchronously here there's a risk we break inflight booting instances about to run ignition. That's in part why we set it for expiration here
hypershift/hypershift-operator/controllers/nodepool/nodepool_controller.go
Lines 772 to 776 in 6bb55df
if err == nil { | |
if err := setExpirationTimestampOnToken(ctx, r.Client, tokenSecret); err != nil && !apierrors.IsNotFound(err) { | |
return ctrl.Result{}, fmt.Errorf("failed to set expiration on token Secret: %w", err) | |
} | |
} |
cc @relyt0925
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enxebre in order to annotate it correctly, we need to know that the secret we're reconciling is a token secret. Do you want us to add something machine-readable to identify them, or is some string prefix check on the name sufficiently safe?
hypershift-operator/controllers/nodepool/nodepool_controller.go
Outdated
Show resolved
Hide resolved
000f5f8
to
86bb63e
Compare
@enxebre took a stab at making it harder to accidentally change one hash and not both, and updated to mark tokens for expiry |
Thanks! Dropped a few more comments, lgtm otherwise. Let's run through ibm folks afterwards. |
86bb63e
to
986495f
Compare
@enxebre updated for your comments - please let me know if you find the comment in #3969 (comment) convincing. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre, stevekuznetsov The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
986495f
to
2cdf40a
Compare
/hold Revision b071aaf was retested 3 times: holding |
b071aaf
to
21bfd1a
Compare
21bfd1a
to
964c6ae
Compare
/retest |
/hold cancel |
/retest |
/override "Red Hat Konflux / hypershift-operator-main-on-pull-request" |
@csrwng: Overrode contexts on behalf of csrwng: Red Hat Konflux / hypershift-operator-main-on-pull-request In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/retest |
When we change the config for a NodePool or the way in which we hash the values, we leak token and user data secrets. However, since these secrets are annotated with the NodePool they were created for, it's simple to check that the Secret still matches what we expect and clean it up otherwise. Signed-off-by: Steve Kuznetsov <[email protected]>
964c6ae
to
41c0b63
Compare
/lgtm |
@stevekuznetsov: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
ca90be6
into
openshift:main
@stevekuznetsov: Jira Issue OCPBUGS-32467: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-32467 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@stevekuznetsov: #3969 failed to apply on top of branch "release-4.16":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines. https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777 kubernetes-sigs/cluster-api-provider-aws#3805 We regress that behaviour here openshift#3969 This PR fixes that by statically checking the hc release version.
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines. https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777 kubernetes-sigs/cluster-api-provider-aws#3805 We regress that behaviour here openshift#3969 This PR fixes that by statically checking the hc release version.
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines. https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777 kubernetes-sigs/cluster-api-provider-aws#3805 We regress that behaviour here openshift#3969 This PR fixes that by statically checking the hc release version.
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines. https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777 kubernetes-sigs/cluster-api-provider-aws#3805 We regress that behaviour here openshift#3969 This PR fixes that by statically checking the hc release version.
We currently need to keep old userdata secrets in AWS < 4.16 to prevent machineDeployments rollout from failing to delete old Machines. https://github.com/openshift/hypershift/blob/3efa23b932a59681346a7a432d349cfb6e44b13d/hypershift-operator/controllers/nodepool/nodepool_controller.go#L775-L777 kubernetes-sigs/cluster-api-provider-aws#3805 We regress that behaviour here openshift#3969 This PR fixes that by statically checking the hc release version.
When we change the config for a NodePool or the way in which we hash the values, we leak token and user data secrets. However, since these secrets are annotated with the NodePool they were created for, it's simple to check that the Secret still matches what we expect and clean it up otherwise.
/assign @sjenning